ERIC 



DOCOaBUT HESflflB 



feD 169 066 

AOTHOR 
IITLE 

FOB DATE 
NOTE 

EDRS PRICE 
DESCRI PTORS 



IDENTIFIERS 



I 



\ 

TH 007 341 



in Measuring 



Bejar, Issac !• 

Applications of Adapt ivs Test ing 
Achievement and Perf oraance. 
[76] 
8p. 



MF01/PC01 Plus Postage. 

*Achievement Tests; Complexity Level; *Confidence 
Testing; Factor Analysis ; Scoriijg Formulas ; 
Simulation ; *T est ing; Test Items; Test Reliaoiiity 
Computer Assisted Testing; *Latent Trait Models; 
*Tailored Testing 



ABSTRACT 

The concept 'of testing for partial knowledge xs 
considered with the concept of tailored testing. Following tae 
special usage of latent trait theory^ the word valdity is used to 
mean the correlation of a test with the construct the' test measures. 
The concept of a method factor in the test is also considerea as a 
part of the validity. The possible effect of scoring for partial 
knowledge on such hypothetical tests is considered together witn tae 
loqic of these hypotheses. The application of latent trait theory to 
a mathemaltical model is used to provide estimates of the expected 
gain in information a(s a function of the increase in inter-^item 
correlations. Finally, these conceptri aris combined with the concepts 
cf tailored testing. Two aspects of tailored testing are considerea, 
tailoring test length and tailoring test difficulty. The 
possibilities of adapting tailored testing to non-dichotomous item 
scorinq are considered in ord,er to adapt tailored testing to tne use 
of partial knowledge in the test score. (^TM) 



♦ Reproductions supplied by EDRS are the best that can be* made * 

* ^ J, from^. the original document. ♦ 
i^^^t^mmm^ik'^' ******* *********** ********* 



ERIC 



•PERMISSION TO REPHOOUCK THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EOUCATION/5L RfSOURCES 
INFORMATION CENTER IbRICI AND 
USERS OF THE ERIC SySTFM - 



O 
t— « 



Applications of Adaptive Testing in 
Measuring Ach i f V(^nif nt .uul Pt-rf ( 'nnar. .. e 

by . 
Isaac T . Be j ar 
Un i ve r s i t v o f M^i nn o so t a 



us OEPARTMENT OF HEALTH 
EOUCATION 4 WELFARE 
NATIONAL INSTITUTE OF 
EOUCATION 

.DOCL'VFNT HAS BtFN «EPPO. 
Ou(ED ExArriv as wtCfrtvFD fpov 
ME. Pr WSON OW OWr,ANi7ATiON OQldN. 
.'.T.NC. .T POiNTS Of v.fw 0« OPINIONS 
■■'-Tf D DO Ncn Nf r.f: sSAPn. y '<EPWf. 
\f NM-W ( ,r;.-.i NATKDNAi iNSTiTl'TFO* 
f Dnc .\T u ,N foSi T -ON Ofv PQi >cy 



^ DAJ '"' duct ion 

Achievement testing cpnsists of locating jndividuals on an 
achievement scale. I'sually, to interpret achievement test scores, - 
a transforniation is applied to the scores which allows an interpre- 
tation in terms of the relative standing of an individual with 
respect to the norming group. In many instructional settings, this 
interpretation is not adequate and, as a result, a demand for more 
concrete kinds of interpre^.ation has emerged. TV.e frequency with 
which criterion-referenced testing,, mastery testing and similar 
approaches are used is" evidence that the suggestion has been wel- 
comed by test users. 



Ivhat is unique about these testing procedures is that the ^ 
items that constitute the test are sampled from a population of items 
which is isomorphic with the objectives of thfe' instruct ional program 
on which we want to measure achievement (Shoemaker, 1975) . Because 
of this, it is possible to interpret scores in terms of what the 
student can do in relation to the objectives of the instructional 
p rogram . 

Undoubtedly, th^s attention to content is bound to increase the 
quality of test scores. Today I'd like to describe our efforts at 
the Univer>s.ity of Minnesota to improve 'achievement testing in 
general, includin-g criterion-referenced testing approaches, by means 
of rore refined response procedures as well as by adapting the test 
to /:he individual. 



CO 



Background 

Most psychometric theory assumes dichotomous scoring; that is, 
responses are classified as either correct or , incorrect . however, 
knowledge is seldom binary, and by proceeding as if it were, partial 
knowled;,e is not given due recognition. If, in ""act, partial infor- 
mat ion is pr«^sent, then extracting it should lead to more valid and 
reliable scores. i ' . . 



The research literature, however, does not support the last 
.statement. The results of the typical investigation show that while 
n-iiab^i 1 ity is usually increased by taking p-arti^l knowledge into 



t his research is supported by Contract ::000 1 A -"f^-L:-062 7 , NR 150- 58^', 
witi; the Personnel and Training Research Progrars, Office of Xaval 
i:*- Search, David J. Weiss, Principal Investigator. J 




504 

2 



account, tfa- validity of the' scares r-maln.'^ tlif snnu' or even 
climinislu'S Such findings are usunllv iiUorprtncd as eviden«;o 
against the usefulness of. the assessmtMit of part ial knowledge. To 
me, tiiev indicate thc-ft- soir.ething is amiss, for example, that the 
test ami I he criterion are no-t un id imens iona 1 . 

To illusLrate, consider two tcsis, A and B, measuring a single 
construct. Both A and B c^urrelate .60 with the construct. This 
can be siimr.arized as follows: 



.60 
.60 



Test 
A 
B 



[1] - 

in Equation 



•which in this cas^ becomes Eauation 3; 



:2] 



.60 
,60 



[.60 



A' 



.60] ^ 



.6A 
.00 



.00 
.64 



r. 



36 
.36 



36*" 




.64 


.00 


36 




.00 


.64 



1.00 
.36 



.36 
1 .00 



[3] 



If we refer to the off-diagonals of 



as validities arid to the 
diagonals as reliabilities, in this case both A and B have a relia- 
bility equal to .36 and -validity of .36. Now suppose Test A \s 
administered under conditions that allow for partial knowledge and 
that, as a result, its correlation with the construct goes from ^60 
to .70. Following the same procedure, we^^now find that the relia- 
bility of A is .49 while that of B remainB at .36, and that the corre- 
lation TvalidityO has gone up from .36 to\^.42. In short, when there 
is. a common factor between two measures, an increase in the relia- 
bility, of one of them will lead to an increase in validity. This is 
not so vhellt T^or*£^ than one 'actor is common. 



5\ morn^ I 



To illustrate'this 
convent ional ly , have in 
method factor, and that 



, assume that Tests A and B, both administered 
coirmon, .in addition to the construct , a 
both correlate .40 with ^L. That is. 



3 

505 



.60 



.40 

,40 



Test 
A 



Assur^ing that the construct and the metb.od factor are uncorrelated 
in the population, tho correlation matrix for A 'and B, according to 



in 


Equation 2 




i y g L veil by 


".60 


.4r." 






.60 


\' 

.60 




.60 


.40 






.40 


.40 












u 






V 
















. J . 












.52 


.52 






• AS 


0 . QO 


.32 


.52 






0.00 


.4 


0^ 



[".48 ■ .OO' 
|.00 .48 



1 . 00 



.52 
.0*0 



The validitv is .52. 



[5] 



Now suppose that Test A above is again administered under con- 
ditions that allow h^r the scoring of partial information and that, 
as a result qf this, its correlation with the construct becomes .70. 
At the same time the correlation of Test A with the method" factor 
drops from .40 to .20; i.e., A becomes , 



and 



.70 

.60 



53 
50 



.20 

,40 



.50 
.52 



Test A (with partial knowledge)' 
Test B 



' [6] 



[7] 



Thus, as a result of introducing partial knowledge, the validity was 

reduced from .52 to .50. However, it is clear that this seemingly 

di sTippointing result is not inconsistent with the true improvement 

that occurred, namely an increase of the correlation with the con- 
struct. 



Although this example contains many assumptions, it seems that 
something similar occurs with real data. Hakstian and Kansup (1975) 
compared the validity of a verbal ability test administered under 
conventional and elimination scoring (C\;.jmbs, Millholland & Womer, 
1956j instructions. Validity was defined as correlat ion with 

sch(VAl grades in language arts. This correlation w^s .49 ur.der 



506 

4 



c-onvunt uMial admin Ls^t rat i vni and .39 under i'l iminat ion scoring. 
However, the correIal:inn with another verbal ability tet^t wa^ .59 
under conventional scoring and .^1 under rjiminatiim scoring. Thus, 
defining validity as the correlation with school grades, elimination- 
scorin'^^ appears to be'less va 1 i^j ; hut defined as the correlation 
witli anoliier verbal ahilitv .score, elimination scoring is more valid. 
Tlust: two findings are not contradictory but simply provide evidence 
'of th.e fact that school giadcs and test scores are not unidimensional 

v^^intages oi^ JisjL_n£ .2£L^X?A1-. Jjl?iin"i^AL^'^ 

In short, I think a critical review of the literature will 
convince most that the question is not wn'cthur partial knowledge 
scoring improves the validity and reliability of test scores but 
rather uncier what conditions are gains to be expected, and now large 
those gains are likely to be, in 'particular whether^ they are large 
enough to of f set anyv increase in testing\time. It stands to reason 
that if method's for the assessment of partial knowledge are to yield 
improved test scores, the tests must be such that there will be an 
opportunity for partial knowledge to emerge. With few^ exceptions , 
most notably Coombs, et al. , the presence of partial knowledge is 
nevej tested. Some theoretical results suggest that when partial 
knowledge is allowed to emerge and it is scored^, dramatic improve- 
ments in .test scores follow. 

^ To' illustrate this, I computed the infa|Tnation functions of two 
latept trait models. (You will recall that Tnfonnat ion at a given • 
point on the underlying trait is the reciprocal of the variance of 
^ the maximum likelihood estimate at that point. Therefore the larger 
'the information value, the more precise our estimate of the location 
of an individual on the trait.) One of the models uses the two- 
parameter normal ogive wliich is appropriate for dichotomOus scoring. 
The other model was Samejima's (1969) graded response model, which 
is an extension of the two-parameter normal ogive to polychotomous 
scor^.n'g^ You ...ay think of the information of the graded model as 
the case wh^n partial knowledge is taken* into account, whereas the 
infornation provided by the dichotomous model is' that provided when 
partial information is ignored. 

To simplify the comparison, I computed for each model the mean 
infotT:ation assuming that the underlying trait was normally distrib- 
uted. The ratio of the mean information for the graded model over 
that of the dichotomous model for several levels, of te^t homogeneity 
is seen in Table'l. /or example, at r-.30 the ratio is 1.42. This^ 
means that, on the average, the use of partial knowledge will be 42% 
more informative than if it is ignored. Note that this improvement, 
due to incorpo/ating partial information into the scores, increases 
as the -discrimination of the test increases. In other words, the 
better the test, the more it will benefit from adding p^^tial knowl- 
edge,, 



507 o 



' ^ ' Table 1 

Ratio cf Mean Information of Graded to 
Dichotomous Model, as a Function of Intet-Itfcn Correlation 

I n t e r 1 t enn c o r r e 1 a t ion _ 
\ .}0_^ .4 0 ..50 .00 .70 ^ .go - 
Ratio of me an information 1 .42 1.43 1.A8 1.52 1. 58 1.90 

The advantages derived from taking partial^ knowledge into account 
can onl.y materialize under the proper conditions. In the conventic .il 
testing .situation, even though partial knowle'dge ^^luences which 
alternative is chosen, the response is scored as correct or iitcorrect. 
One way of allowing credit to be given for partial knowledge is to 
instruct testees to segregate alternatives into different categories.. 
Coomb's procedure is an instance of this approach where the cate- . 
gories aje "correct" and "incorrect". Other categories are possible, 
though; for^xample, verbal items jnay be classified as synonyms, 
antonyms, or neither. ' . 

Computeriz ed Testing • ^ ■ ' 

Recording and scoring responses to this kind of item is not, 
however, convenient with paper and pencil administration. This brings 
■me to another aspect of our research, namely the use of computers. 
One obvious use of computers is to handle the recording and scoring 
of resjjonses, but as previous presentations in this symposium suggest, 
the computer can also be used to adapt or tailor the tes'^; to each 
.individual. 

These presentations, and indeed most of the research in computer- 
ized adaptive testing, are oriented toward abil ity. measuremen t . In 
achievement testing, we should distingruish between two kinds of tai- 
loring. One is tailoring the length of the test and the other is 
tailoring the difficulty of the test.^ 

Tailoring the length of the test is appropriate' in instructional 
settings where each individual is allowed as much time as necessary, 
to complete a given unit of Instruction. Under those conditions, 
individual" differences wi.th respect t-o knowledge are minimized and 
it becomes profitable to' tailor the test in te^^ms of length rather 
than difficulty. ^ The research of Ferguson (1970) is an example of 
this type cf tailoring. In his system, an individual is tested 
until he is classified into a non-mastery or- maste^ category. The 
statistical basis of* this sy^stem is that of Wal'^'s ii-equent ial 1 ikel i- 
hood ratio test. Ferguson's, model assumes that the di f f icul t/' and 
disc,ri,minat ion of all items- are the-same. It is not known how sensi- 
tive the procedure is with respect to violation of these assumptions. 
Research addressed to this question ij;; needed.- It would also be 
desirable to study the possibility of relaxing the model to allow for 
unequal item difficulties and discriminations as well as allowing for 
polychotomous responses. 



508 

^ 6 



Altlu)iii^h self-paciui instruction iia,s many ailvnnLa^^os , limited 
resources often do i>ot permit its full implemcMita l ion . As . a result, 
tlie sample under instruction will likely be he Leroguneous wiLii 
respect to achievement. ^ Similarly, if we are testii)^' for retention 
of achievement or ft)r levels of aclnevement accjuired prior to 
inst motion, we will also find wide variation in performance. Ihder 
these conditions, tailoring the test to an individual's level of 
achievement will bu more efficient than the conventional non-adapt ivi.' 
proccd'j^'e, .IS the previous [ resentat ions suggest. 

One of the major aiuio o^f our research is to combine the advan- 
tages of partial knowledge scor iug and adapt ive t es t ing . Mos t of 
the research on adaptive -testing at the Liiiversity of Minnesota and 
elsewhere has been done in the context of dichotomous response models. 
The exceptions are to be found in the work of B^^yroff 6< Anderson 
(1960), Woodi(1971) and . Same j ima (1975). 

Bayroff & Anderson seem to be the only ones to have actually 
implemented an adaptive testing strategy using non-d ichotomous itetas. 
Essentially what they did was to branch an individual according to 
the- correctness of th'e alternatiye chosen. Although tl^ey used a 
polychotomous item for the f irs t i tem only , th is can be readily 
extended to include all items. Other branching rules are possible. 
Wood (1971) suggested that the optimal branching rule will administer 
as the next item the most discriminating of those items with a mid- 
point of adjacent categories closer to the individual's current 
estimate of achievement. Samej ima (1975) carried out 'd simulation on 
live data of a similav procedure which she referred to as tailoring 
the dichotomization of the item to the individual. She n> ^ced dramatic 
improvements by comparing the plot' of scores based on a uniform 
dichotomization and tailored dichotomization against ^.he scores based 
on the polychotomous responses . 

^ • • " . \ 

Summary 

To summarize, one part of our research is concernea with the 
joint implementation o*' two recefft developments in tes theory : 
adapting the test to the individual and simultaneously extracting 
more., information from each response by record ing py; : ial knowledge. 
The question that remains is whether sets of Items Cr\u be constructed 
such that they will allow partial knowledge to be utilized without 
undulv increasing testing time. By next year*s meel.ing, I hope j:o 
have the answer to- this and other related questions. 



\5 . , ' • 

509 7 . - 



r 



RKFF.RKNCES * 

Bayrc-::, A.G., Thomas, J.J., & Anderson, A. A. Construction 

of an experimental sequential item test. Research M^orandum 
• 60-1, Personnel Research Branch, Department of the Army, 
^ ' January 19.60. 

Coombs, C.H., Millhplland; J.E., & Womer, F.B. The assessment of 
partial knowledge. Educational and Psy chological Measurement, 
1956, 1^, 17-37. ~ 

Fergus:in, R.L. A model for computer-assisted criterion-referenced 
measurement. . Educatio n, 1970, £1, 25-31. 

Hakstian, A.R., & iCansup, W. A conjparison of several methods of 
assessing partial knowledge in multiple choice tests: IT 
cesring procedures. Journal of Educational Meas urement, 
1975, 12, 231-240. \ 

Sameji-a, F. Graded response mddel of the latent ;:rait theory a nd 

, tp Uored test^xng . Proceedings^-* of the First Conference on 

Co~puter*ized Adaptive Testing* United States Civil Service 

Coi:!^i5sion% Bureau of Policies and Standards, 1976. 
f ^ . 

SamejirMa, F. • Estimating latent'^ahTility using a response pattern 
of traded responses. Psychometr ika , 1969, Monograph Supplement 
' Mo. 17. ^ 

Shoeiiia;:er, D.M. Toward a framework iof achievement testing. 
Rev.iev of Education^ Research , 1975, 45, 127-148. 

Wood, Computerized adaptive se q uential testing . Unpublished 

doc::»ral dissertation. University of Chicago, 1971. 



y 



. . 510 



8 



